2018.10.16

Potentially relevant bits of my story

  • Very lazy software engineer; like to automate things
  • Worked with some Expert Systems folks for a couple years at a place known for automated decision-making
  • The inability of these systems to deal with uncertainty bothered me to no end
  • Took time off to do Recurse Center and dive in
  • Realized the majority of people trying to solve problems like this were actaully doing Bayesian modeling
  • Joined Stan Core Development team

How do we make decisions?

  • Try to understand the state of the world
  • Guess at the processes through which the world reacts to stimuli
  • Associate utility scores with possible world states, and model relative chance
  • Often, communication of findings is key

Why is it hard to assign utilities to actions?

  • Hard to represent the state of our knowledge about the world and how it changes
    • We have uncertainty and ambiguity around even the most deterministic of processes
    • And limited resources to discover the true process
  • Many possible models for a process
    • We will never know the True Model
    • Nor the world’s True State

Concrete example - mail delivery

  • I get a few items of mail most days, but some days I get 0
  • How often should I be experiencing no-mail days?
  • I could use statistical analysis to make a case to a reporter that it is worth investigating
  • …imagine that I care a lot about the mail

Mail delivery uncertainty

  • Mail is easy to measure, but the process that generates mail is incredibly complex
    • For example, in a moment of weakness I used some expiring airline miles to subscribe to Vanity Fair
    • No one has time to model my weaknesses
  • Assume that the rate of mail creation is constant over time

The Poisson distribution

Modeling mail delivery with a Poisson

  • A Poisson distribution captures much of the behavior we care about, but
    • Exactly how well does it describe our data?
    • Want to keep track of our uncertainty when making predictions or decisions
    • We suspect someone is messing with our Poisson
  • Visually explain to our reporter friend in terms they find compelling

Frequentism, unfairly summarized

  • Lots of totally valid frequentist math and procedures
    • Typically quite difficult
  • Instead, set up your experiment correctly and then use some off-the-shelf test or estimator to get The Best Answer (point estimate)
    • Oh, and make sure this list of assumptions all hold
  • Alternatively, do a ton of math to prove asymptotic gaurantees specific to your model
  • No parameter uncertainty, because probability only represents long-run frequency
    • This actually matters when thinking about and communicating results
  • “No” priors; incorporate minimal expert knowledge

Machine learning, fairly summarized

  • We have computers now, calculus can GTFO
    • fair enough
  • Great success with algorithms that have only later had statistical interpretations ascribed to them
  • Progress = predictions 0.01% better than the last paper according to cross validation
    • Let’s ignore that the noise around cross-validated accuracy scores is much higher than 0.01%
  • Why would we need to interpret the model if its predictions are accurate?

What does Bayesianism preach?

  • Philosophically, inference without a prior is impossible
    • In frequentism it’s buried in an assumption
    • In ML you are not really sure if there is a deeper meaning under the code in the first place, but there is some implied prior
  • Retain a distribution (as opposed to point estimates like the mean) as long as possible before making a decision
  • We also (only recently?) have computers!
    • We can actually express an accurate, bespoke model for our data generating process without worrying about an analytic integral
  • Construct many models
  • Always Be Graphing

What are we modeling?

  • “If we knew our parameter values \(\theta\), how would data \(y\) be generated?” \[\pi(y \, | \, \theta)\]
  • But we want to learn what our data implies about \(\theta\)
  • \[\pi(\theta \, | \, y) = \frac{\pi(y \, | \, \theta) \pi(\theta)}{\pi(y)}\]
  • In our case, \(\theta\) might be the average amount of mail we get on any given day, or the percentage of days the carrier skips

Practical modeling with Stan

  • Stan is an imperative probabilistic programming language
    • in the vein of BUGS, but can do arbitrary computation
  • A Stan program defines a probability model
    • declares data inputs and parameter variables
    • defines a log posterior density function
  • Stan comes with a few inference algorithms
    • Hamiltonian Monte Carlo for full Bayesian inference
    • Variational Inference
    • penalized Maximum Likelihood Estimation and optimization

Why choose Stan?

  • Expressive
    • Stan is a full imperative programming language
    • Easier to communicate models
  • Robust
    • usually works; signals when it doesn’t
  • Efficient
    • effective sample size per unit time winner
    • 3 forms of parallelism coming out this year (CPU, GPU, cluster)
  • Great documentation, case studies, and community!

Anatomy of a Stan model

  • data and parameters just declare inputs to a density function
  • model block defines that function and, given some data and parameters, returns a scalar real value density of the posterior at that location in parameter space
  • This defines \(\pi(y \, | \, \theta)\) from a few slides ago
  • Stan provides inference mechanisms to learn \(\pi(\theta \, | \, y)\) from a model’s \(\pi(y \, | \, \theta)\)

Normal mail delivery Stan model

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  mail ~ poisson(avg_mail_per_day);
}

Fitting a Stan model with PyStan

import pystan, numpy as np
data = dict(mail = np.array(r["mail"], dtype="int"), num_days=int(r["num_days"]))
model = pystan.StanModel("usps1.stan")
fit = model.sampling(data)
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%      50%  97.5%  n_eff   Rhat
avg_mail_per_day   4.96  2.3e-3   0.08    4.8    4.96   5.11   1316    1.0

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Plotting the fit

from matplotlib import pyplot as plt
plt.hist(fit.extract()["avg_mail_per_day"], bins=20)

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])
  • Posterior Predictive Check (PPC) plot
  • Posterior distribution is a distribution over parameter space incorporating what we’ve learned about our parameters
  • Simulate data based on the posterior and plot it against the actual observed data

Visually communicating model fit

from matplotlib import pyplot as plt
mail_rate = fit.extract()["avg_mail_per_day"] # 4000 posterior samples for this param
fake_data = np.random.poisson(mail_rate) #4000 fake data simulations
plt.hist([fake_data, data["mail"]], bins=15, density=True, color=["red", "blue"])

Add skip_percentage to model skipped days

data {
  int<lower=0> num_days;
  int<lower=0> mail[num_days];
}
parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density;
  }
}

Fit model with skip_percentage

Inference for Stan model: anon_model_7abb4b18761050015bf05aee642e2a30.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd   2.5%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.77  2.0e-3   0.09   5.58   5.76   5.96   2328    1.0
skip_percentage    0.99  2.2e-4   0.01   0.96   0.99    1.0   2791    1.0
lp__              -1599    0.03   1.06  -1602  -1598  -1598   1336    1.0

Samples were drawn using NUTS at Tue Oct 16 22:41:45 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Simulate data implied by parameter draws

mail_rate15 = fit15.extract()["avg_mail_per_day"]
skip_percentage = fit15.extract()["skip_percentage"]
fake_data15 = np.zeros(len(mail_rate))
for i in range(len(mail_rate15)):
    todays_mail =  np.random.poisson(mail_rate15[i])
    if np.random.binomial(1, skip_percentage[i]):
        fake_data15[i] = 0
    else:
        fake_data15[i] = todays_mail

Visually compare simulations with observed data

Visually compare simulations with observed data


Fix our math; intro log scales

parameters {
  real<lower=0> avg_mail_per_day;
  real<lower=0, upper=1> skip_percentage;
}
model {
  avg_mail_per_day ~ normal(0, 5);
  for (day in 1:num_days) {
    real without_skipping_density = poisson_lpmf(mail[day] | avg_mail_per_day);
    if (mail[day] == 0)
      target += log_mix(skip_percentage, 0, without_skipping_density);
    else
      target += without_skipping_density
                + bernoulli_lpmf(0 | skip_percentage);
  }
}

Fixed fit

Inference for Stan model: anon_model_387876dd9af8a07353d37466d2bde6fb.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean     sd    25%    50%  97.5%  n_eff   Rhat
avg_mail_per_day   5.52  1.6e-3   0.09   5.46   5.52    5.7   3350    1.0
skip_percentage     0.1  2.0e-4   0.01    0.1    0.1   0.13   3072    1.0
lp__              -1821    0.02   0.97  -1821  -1821  -1820   1884    1.0

Samples were drawn using NUTS at Sat Oct 13 13:45:07 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

PPC plot for corrected model

PPC plot for corrected model


Pure poisson (left) vs. with skipped days (right)


A typical non-Bayesian approach

  • Plug the model (sans prior) and data into an estimator like maximum likelihood (w/ implied uniform prior)
  • This means using an optimizer to find the mode (maximum) of the distribution, and then
  • (Frequentist) Perform some test to get a p-value against some “null hypothesis” to argue for their arbitrary alternative
  • (Machine Learning) Show that with certain initial values, cross-validation scores on test set were higher at least one time

What is wrong with optimization and point estimates?

Area in higher dimensions is weird

Curse of dimensionality

Curse of dimensionality

  • “Somewhat counterintuitively, the average member of a population is an outlier. How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.” -Carpenter
  • Recall the Airforce’s attempt to design a cockpit for the average pilot

Curse of dimensionality


(from Mackay’s book)

How Stan’s sampling differs from optimizing

  • Hamiltonian Monte Carlo (HMC) draws parameters from the typical set
  • Scales appropriately with dimension
  • No U-Turn Sampler (NUTS) provides an adaptation routine on top of HMC

How does HMC work?

Bayesian methods are intuitive and general

  • Retain uncertainty up to the point of a decision
  • Capture more interesting data generating processes and relationships
  • Understand how your model works
    • enables fairness, accountability, and transparency
  • Can intuitively use draws to answer ad-hoc questions about inferences
    • “Given this model, in what percentage of worlds is the mail carrier skipping >0 days”
    • Replaces frequentist statistical tests
  • Sensitivity analysis (“What could this model/data tell me?”)
    • Tomorrow in Betancourt’s class, in the case studies, or our course in November!

Additional resources